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I.  INTRODUCTION 
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In  the  development  of  concurrent  algorithms  for  vector 
multiprocessors  (VMP),  a  major  distinction  exists  between  small-  and 
large-grain  tasking.  The  latter  is  implemented  at  a  high  level  of 
program  logic  (e.g.,  a  physical  space  is  divided  into  subspaces,  each 
involving  significant  computation)  from  a  high-level  language,  and  will 
achieve  a  speedup  nearly  equal  to  p,  the  number  of  processors,  for  a 
variety  of  applications  and  for  any  VMP  worthy  of  the  name.  Small- 
grain  tasking  is  implemented  at  a  low  level  to  achieve  similar  speedups 
with  more  but  smaller  tasks.  When  this  task  size  is  at  the  vector 
operation  level,  it  has  been  termed  "microtasking"  by  this  author  [1], 
Although  early  implementations  of  small-grain  tasking  were  in  assembly 
language  on  the  Cray  X-MP  [1],  Cray  Research  now  feels  that  system 
software  can  be  written  with  sufficiently  small  tasking  overhead  that 
similar  performance  can  be  achieved  from  Fortran.  Even  more  important, 
a  code  such  as  the  matrix  multiply 


DO  1  I  =  1,  N2 
DO  1  J  ■  1,  N1 
1  X(J)  -  X(J)  +  M(J,I)*Y(I) 


could  be  automatically  microtasked  by  a  compiler  by,  for  example, 
designating  a  task  to  be  simply  M{J,I)*Y(I)  for  all  J.  Such  auto- 
tasking  would  not  be  possible  in  the  large-grain,  where  global  (likely 
problem-dependent)  data  dependencies  are  a  consideration. 

In  summary,  automatic  small-grain  tasking  offers  the  ultimate  in 
user  convenience,  with  possibly  modest  loss  in  efficiency.  It  is  clear 
that,  where  appropriate,  small-grain  algorithm  studies  on  current  VMP's 
should  continue,  although  current  implementations  are  from  assembly 
language.  These  will  challenge  VMP  architects  to  offer  hardware  to 
support  such  tasking  (like  the  X-MP),  rather  than  assume  that  only 
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large-grain  tasking  need  be  supported  (like  the  Cray-2).  We  may  suffer 
from  the  latter  decision  for  many  years. 


II.  PROGRESS  REPORT 

A.  Introduction 

The  availability  of  instruction-level  simulators  for  the  X-MP  and 
the  Cray-2  is  making  possible  the  study  of  small-grain  issues  for  many- 
processor  VMP  configurations.  These  include 

(a)  microtasking  as  indicated  above,  and 

(b)  designing  conflict-resistant  algorithms. 

B.  Hybrid-tasked  algorithms  • 

Studies  of  microtasking  with  up  to  16  X-MP  processors  for  LU 
decomposition  of  dense  systems  of  equations  have  given  rise  to  hybrid 
algorithms  of  the  form  of  Figure  1.  Here,  a  microtasked  decoupling 
step  is  performed  by  all  processors,  to  produce  large-grain  tasks  which 
carry  out  the  bulk  of  the  computation.  Originally  simulated  for  a 
(hypothetical)  architecture  of  Cray-1's  connected  to  a  common  memory 
[1],  a  followup  study  has  been  made  for  a  similar  configuration  of  X-MP 
processors.  The  efficiency  of  the  resultant  algorithm  is  shown  in 
Figure  2.  An  associated  paper  has  been  accepted  for  publication  [2] 
and  a  similar  study  is  being  prepared  for  the  Cray-2.  It  was  pointed 
out  by  one  reviewer  that  such  examples  may  be  considered  as 
characterizing  both  the  small-  and  large-grain  attributes  of  an 
architecture  in  a  single  application.  Of  course,  the  computation  model 
of  Figure  1  has  merit  on  its  own  right  as  one  method  of  approaching 
problems  not  susceptible  to  large-grain  tasking  only. 

C.  Conflict-resistant  algorithms 


Both  the  X-MP  and  the  Cray-2  have  a  design  conflict  between  the 
need  to  conserve  physical  and  wiring  space  (which  translates  into 


speed)  by  minimizing  the  number  of  memory  banks,  and  the  need  to 
isolate  computations  of  processors  connected  to  common  memory  and 
sharing  memory  banks.  This  results  in  an  X-MP  design  which  has  a 
marginal  conflict  problem  for  4  processors,  and  a  serious  degradation 
if  16  processors  are  used  with  a  proportionately  large  number  of  banks. 
(The  latter  conclusion  was  a  byproduct  of  the  16-processor  algorithm 
studies  above.)  This  result  was  considered  of  national  significance, 
and  an  associated  study  was  commissioned  by  Los  Alamos  National 
Laboratory  [5] . 

This  also  gave  rise  to  a  diversion  of  our  AFOSR  grant  resources 
into  the  new  question  of  the  effects  of  bank  conflicts  -  expressed  by 
access  delays  -  on  algorithm  performance.  (Early  experience  with  the 
Cray-2  indicates  this  may  be  of  overriding  consideration.)  An 

algorithm  conflict  sensitivity  has  been  defined  in  [4]  as 

S  =  total  algorithm  delay  ( clocks ) 
average  access  delay  (clocks) 

which  measures  this  effect.  To  reduce  this  ratio,  it  has  been  shown 
possible  to  assembly-code  the  X-MP  so  that  accesses  are  pre-fetched 
into  vector  registers;  ordinary  delays  in  an  access  do  not  affect  the 
later  utilization  of  this  data  by  the  algorithm.  These  are  examples  of 
conflict-resistant  algorithms .  This  concept  has  particular 
significance  to  library  codes,  which  are  often  assembly-coded  for 
speed;  their  incremental  conflict  sensitivity  can  often  be  reduced  to 
zero.  An  associated  report  has  been  prepared  [4] . 

III.  COUPLING  ACTIVITIES  (WITH  NATIONAL  LABORATORIES) 
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E.  Research  Center  for  Advanced  Computer  Science  (RIACS, 
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Consultant,  assisting  in  the  conversion  of  Computational 
Chemistry  codes  to  the  Cray-2,  and  the  study  of  Cray-2  MP 
algorithms. 
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Figure  1.  Hybrid  granularity  computational  model 


Figure  2.  Performance  of  hybrid  code 
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